1 . regress std_log_displaced severity std_log_affected std_log_duration
Source | SS df MS Number of obs = 3034
-------------+------------------------------ F( 3, 3030) = 239.40
Model | 581.158345 3 193.719448 Prob > F = 0.0000
Residual | 2451.84164 3030 .809188658 R-squared = 0.1916
-------------+------------------------------ Adj R-squared = 0.1908
Total | 3032.99998 3033 .999999994 Root MSE = .89955
----------------------------------------------------------------------------------
std_log_displa~d | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
severity | .1240879 .0421517 2.94 0.003 .0414392 .2067367
std_log_affected | .1801437 .0188169 9.57 0.000 .1432485 .217039
std_log_duration | .3070196 .0184443 16.65 0.000 .2708549 .3431842
_cons | -.2192853 .0545864 -4.02 0.000 -.3263154 -.1122551
----------------------------------------------------------------------------------
Next, we proceeded to check if the assumptions for linear regression hold:
No evidence of significant collinearity (VIF <10).
2 . vif
Variable | VIF 1/VIF
-------------+----------------------
std_log_du~n | 1.30 0.766732
std_log_af~d | 1.29 0.774600
severity | 1.06 0.945632
-------------+----------------------
Mean VIF | 1.22Assumption of linearity appears satisfied.
Estimate principal components.
3 . pca severity std_log_affected std_log_duration
Principal components/correlation Number of obs = 4312
Number of comp. = 3
Trace = 3
Rotation: (unrotated = principal) Rho = 1.0000
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 1.58649 .701889 0.5288 0.5288
Comp2 | .884599 .355685 0.2949 0.8237
Comp3 | .528914 . 0.1763 1.0000
--------------------------------------------------------------------------
Principal components (eigenvectors)
----------------------------------------------------------
Variable | Comp1 Comp2 Comp3 | Unexplained
-------------+------------------------------+-------------
severity | 0.4067 0.9125 0.0443 | 0
std_log_af~d | 0.6412 -0.3196 0.6977 | 0
std_log_du~n | 0.6508 -0.2554 -0.7150 | 0
----------------------------------------------------------
Component 1 explains ~53% of variance, so compute score of that component (pc1) and regress on that alone.
As seen, model R2 is similar to earlier model with separate terms for "magnitude" and "duration " (R2 ~0.18 versus ~0.19).
4 . regress std_log_displaced pc1
Source | SS df MS Number of obs = 3034
-------------+------------------------------ F( 1, 3032) = 660.36
Model | 542.436732 1 542.436732 Prob > F = 0.0000
Residual | 2490.56325 3032 .821425874 R-squared = 0.1788
-------------+------------------------------ Adj R-squared = 0.1786
Total | 3032.99998 3033 .999999994 Root MSE = .90633
------------------------------------------------------------------------------
std_log_di~d | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pc1 | .3320503 .0129215 25.70 0.000 .3067145 .3573861
_cons | -.0640488 .0166419 -3.85 0.000 -.0966794 -.0314183
------------------------------------------------------------------------------
Regression model with "pc1" appears to be equal to or superior R2 to regression models with "du ration" alone or "magnitude" alone.
5 . regress std_log_displaced severity
Source | SS df MS Number of obs = 3034
-------------+------------------------------ F( 1, 3032) = 67.93
Model | 66.4655162 1 66.4655162 Prob > F = 0.0000
Residual | 2966.53446 3032 .978408464 R-squared = 0.0219
-------------+------------------------------ Adj R-squared = 0.0216
Total | 3032.99998 3033 .999999994 Root MSE = .98915
------------------------------------------------------------------------------
std_log_di~d | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
severity | .3714916 .0450724 8.24 0.000 .283116 .4598672
_cons | -.4636299 .0590483 -7.85 0.000 -.5794086 -.3478511
------------------------------------------------------------------------------
6 . regress std_log_displaced std_log_affected
Source | SS df MS Number of obs = 3034
-------------+------------------------------ F( 1, 3032) = 374.66
Model | 333.563804 1 333.563804 Prob > F = 0.0000
Residual | 2699.43618 3032 .890315361 R-squared = 0.1100
-------------+------------------------------ Adj R-squared = 0.1097
Total | 3032.99998 3033 .999999994 Root MSE = .94357
----------------------------------------------------------------------------------
std_log_displa~d | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
std_log_affected | .3362418 .0173714 19.36 0.000 .3021809 .3703027
_cons | -.0289544 .0171955 -1.68 0.092 -.0626704 .0047615
----------------------------------------------------------------------------------
7 . regress std_log_displaced std_log_duration
Source | SS df MS Number of obs = 3034
-------------+------------------------------ F( 1, 3032) = 590.56
Model | 494.448928 1 494.448928 Prob > F = 0.0000
Residual | 2538.55105 3032 .837252986 R-squared = 0.1630
-------------+------------------------------ Adj R-squared = 0.1627
Total | 3032.99998 3033 .999999994 Root MSE = .91502
----------------------------------------------------------------------------------
std_log_displa~d | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
std_log_duration | .3992275 .0164281 24.30 0.000 .3670161 .4314389
_cons | -.063597 .0168168 -3.78 0.000 -.0965705 -.0306235
----------------------------------------------------------------------------------
Check assumptions of linear regression for model with principal component alone:
No evidence of significant collinearity (VIF <10).
8 . vif
Variable | VIF 1/VIF
-------------+----------------------
pc1 | 1.00 1.000000
-------------+----------------------
Mean VIF | 1.00Assumption of linearity appears satisfied.
For this section, we mainly look at the dataset from Dartmouth Flood Observatory–GlobalFloodsRecord.xls, which documents flood events in all parts of the world since 1985 until early this year–and summarize the data grouped by country. We pick five variables: duration of flood in days, number of death, magnitude of the flood, severity level, and the affected area in squared kilometers; then sum each variable up for each country to get the cumulative flood information for every country.
The country name variable in the dataset is kind of messy, in the way that there are not only NA entries but also misspelled country names. We take care of this issue and group by country using “dplyr” package.
Now that we get a dataset, each of whose row is a country and each column is a feature (total duration in that country, cumulative severity of flood, etc.), we derive the geolocation of each country in terms of longitude and latitude by using geocode() function from “ggmap” package, and attach them after each row.
We have the first plot as following:
This plot displays each country as a circle on the world map, whose size marks the total number of death during the floods happened in that specific country (bigger means more people), and the color indicates the total duration of all floods in days for that country (darker means floods last longer). The darker (i.e., more red) circles are clearly bigger, which demonstrates an intuitive relationship that the longer the floods lasted, the more people would die.
The second plot explores the relationship between the cumulative severity of all floods and the total affected area in squared kilometers:
as what we would expect intuitively, the darker (more red) circles are obviously bigger, which shows a direct relationship that the more area of the floods affected in a country, the more severe those floods would be.
The third plot helps us to verify our intuition that the longer the floods lasted in a country, the more severe those floods would be:
Next plot investigates the relationship between the cumulative severity of the floods in a country and the total number of death during those floods in that country:
It turns out that these two features also have a direct proportional relationship.
Lastly, we question whether this kind of relationship exists between the cumulative magnitude of floods and the number of death during floods in a specific country. Since magnitude is calculated by the formula \(\text{magnitude}=\text{log }(\text{duration}\cdot \text{severity}\cdot \text{total affected area}),\) and we just observed from our plots the pairwise directly proportional relationship among the three variables used to calculate the magnitude along with the number of death, we’d expect the relationship between these variables and number of death is also apparent. We confirm this speculation with the following plot:
As a second part, the Global flood record dataset was collapsed into a per-continent view, to view a summary of the most relevant predictors by continent:
We decided to subset the data and focus our attention in five flood events that occurred in the United Kingdom during 1990. We came up with these events after slicing the Global Floods data over time and generating an animation that allowed us to perfectly tackle big and isolated flood events over time. The interactive animation is shown below:
Once the entries for the UK floods in 1990 have been located in the Global Flood record, the next step was to locate that information withing the NOAA daily phi database. In order to do this, global flood record dates have been transformed into days since January 1st, 1948, so that the two tables can be correctly matched. The transformed values look like this:
## Began Ended
## 3837 7663 7666
## 3845 7605 7607
## 3856 7583 7584
## 3924 7361 7363
## 3930 7330 7345
Once the phi data for each flood event days was located, we came up with the following protocol for having five relevant datrices, one per flood:
Each dataset is consituted by a ribbon subset of the phi data, ranging over all longitudes (in order to preserve the direction-shifting pattern of pressure waves) but only ranging on latitudes higher than 45N and lower than 60N. This contitutes a grid of 7x144=1008 cells for every day. In order to keep our datasets as matrices, we used the as.vector() function to create a vector collapsing the 1008 grid cells for every day.
Since R can’t handle matrices with high dimensionality with princomp(), and we also wanted to capture the varying effect of pressure levels over time in several time dimensions, we decided to subset observations in the following way:
Hence, the final dimensions of the 5 matrices vary around 2418 and 2964 rows and all of them have 1008 columns (one per grid cell), as detailed below:
## [1] 2496 1008
## [1] 2457 1008
## [1] 2418 1008
## [1] 2457 1008
## [1] 2964 1008
Once we have the slices for each flood event, we ran PCA on them. Below are displayed the plots showing the variance that each component explains:
It can be seen that floods #1, #4 and #5 have at least 6 significant principal components, whereas floods #2 and #3 have only 2 very significant components. All of them, however, show a considerable difference in explained variance between the first component and all other ones.